We will look at the backpropagation algorithm in the context of feedforward, layered networks (Figure 1), which have found broad applications in the literature. These networks have one layer of input ports, the points of application of the input vectors. From each input port, connections radiate toward the next level, the first "hidden" layer of intermediate units. Units are continuous valued, typically with a superSigmoid transfer function. There can be one or several hidden layers.
Finally, the output layer contains the units that are considered outputs. To apply the backpropagation algorithm, which is a supervised learning rule, a training set of inputs must be given along with the corresponding "desired outputs". The algorithm is a generalization of the Widrow-Hoff procedure. Observed errors serve to determine the direction of successive corrections to the weights. However, since weights affect hidden units, error measures must be propagated backward to find the correcting terms for the hidden units.
Specifically, we will explore the simplest two-layered networks of continuous units with a sigmoid output function.
We will test the Backpropagation Algorithm on the EXOR logical function. The EXOR and the EQUIVALENCE functions partition input data into two classes that are not linearly separable.
A 2-layer three node (2-1) feedforward network will be used. (Figure 2) . Note that there is also (not shown below) a bias input into the top three units.
We first create the training data, namely, the input and output lists for the logical function selected (the EXOR). Recall that the first component of the input vector is always constant and provides us with a bias.
We will generate one input-output pair by selecting one of them at random in the function truth table. Then we will calculate (forward) and print out the net inputs and outputs with the initially selected weight values.
We will repeat the procedure, and produce numPlots plots of the global cost, as weights are being updated, to illustrate the dependency of convergence on iniital conditions. All runs will converge but at different rate.
The Unformatted text for this cell was not generated.
Use options in the Actions Preferences dialog box to
control when Unformatted text is generated.
;[o]
-Graphics3D-
:[font = text; inactive; dontPreserveAspect]
Empirical evidence indicates that the cost function defined above has some undesirable but unavoidable properties. It is of course non convex and its global minimum may not be unique. It can have local minima, although experience indicates that they do not constitute a major problem: they are indeed few since they require that all partial derivatives vanish simultaneously and these minima can be by-passed by various techniques. More serious is the fact that ravines and plateaus form easily.
In ravines the derivatives are large in one direction and small in another. On plateaus, cost is high but gradients are low so that each step brings only minute improvements. Plateaus are a consequence of the saturation characteristics of the superSigmoid function. Significant learning occurs only when the units are in the central, near linear, region. With large weights, large NetInputs will occur, bringing the units into the saturation zone. One possible solution is to force the weights to systematically decrease with the iterations, but this technique has the potential drawback of bringing the weights to zero, i.e., to a guaranteed plateau existing around the origin, before reaching convergence.
The consequence of these problems is that, although the procedure almost always converges, the number of iterations for convergence varies widely with the initial choice of weights. There is a zone (probably fractal in configuration) from which convergence is rapid. But other areas will require orders of magnitude more iterations to reach a solution.
It must also be realized that there will be an infinity of possible solutions (weights range) generally forming nonconnected regions in the weight space.